An Overview of Windows AI Foundry: Local AI Development and Deployment

Session Date: May 19, 2025
Duration: 1 hour
Venue: Build 2025 Conference - BRK223
Speakers: Tucker Burns (GPM, Windows Platform + Developer Team), Dian Hartono (Product Manager Lead, Windows Developer Platform Team)
Link: Microsoft Build 2025 Session BRK223

Introduction: The Case for Local AI
- Why Local AI Matters
- Windows AI Foundry Platform Overview
Windows ML: The Foundation Layer
Windows AI APIs: Ready-to-Use AI Capabilities
Customization and Fine-Tuning
Foundry Local: Open-Source Model Ecosystem
AI Workstations and Hardware Optimization
Call to Action and Resources
Appendix
References

Introduction: The Case for Local AI

Timeframe: 00:00:30 - 7m 15s
Speakers: Tucker Burns, Dian Hartono

Why Local AI Matters

Tucker Burns opens the session by establishing the fundamental premise that we are at a turning point for local AI. The presentation begins with a critical insight: running AI models exclusively in the cloud is not always optimal. With the miniaturization of AI models and advances in silicon capability, client-side AI processing has become increasingly viable and necessary.

The speakers identify four key drivers for local AI adoption:

1. Compliance and Data Privacy

The evolving regulatory landscape presents specific challenges:

GDPR (General Data Protection Regulation) requirements for data sovereignty
HIPAA (Health Insurance Portability and Accountability Act) compliance in healthcare
DMA (Digital Markets Act) data handling regulations
Complete data control - models run without data ever leaving the device

2. Performance and User Experience

Local processing offers significant advantages:

Zero network latency - critical for real-time applications
Sensor proximity benefits - audio, video, and input processing optimizations
High availability - operations without internet dependency
Cost optimization - reduced cloud infrastructure requirements

3. Power Efficiency and Background Processing

Modern hardware enables new capabilities:

NPU utilization - Neural Processing Units designed for efficient AI workloads
Proactive processing - models running continuously without performance impact
New experience classes - background AI capabilities enabling innovative features

4. Hybrid Flexibility

The ultimate goal is flexibility between deployment models:

Cloud-only applications for massive scale requirements
Local-only applications for privacy and performance
Hybrid approaches leveraging the best of both worlds

Windows AI Foundry Platform Overview

Tucker Burns introduces Windows AI Foundry as Microsoft’s comprehensive solution for local AI development. The platform is built on three foundational pillars:

Windows AI Foundry Architecture
├── Windows AI APIs: Ready-to-use inbox models
├── Foundry Local: Open-source model catalog and execution
└── Windows ML: Foundation layer for custom models

This three-tier approach provides developers with complete flexibility in how they implement AI capabilities, from simple API calls to complex custom model deployment.

Windows ML: The Foundation Layer

Timeframe: 00:07:45 - 18m 30s
Speakers: Tucker Burns (primary), Dian Hartono (supporting)

Public Preview Announcement

Tucker Burns announces the public preview availability of Windows ML, marking a significant milestone in Microsoft’s local AI strategy. Windows ML serves as the foundational layer upon which both Windows AI APIs and Foundry Local are built.

Core Technical Capabilities

ONNX Runtime Integration:

Industry-standard model execution framework
Cross-silicon compatibility (CPU, GPU, NPU)
Support for PyTorch models and Hugging Face catalog
Out-of-box runtime reducing application binary size

Hardware Acceleration Strategy:

Intel - CPU and integrated graphics optimization
AMD - Ryzen processors and Radeon GPU acceleration
NVIDIA - CUDA and Tensor Core utilization
Qualcomm - NPU-optimized execution for Snapdragon platforms

Developer Experience and Tooling

The Windows ML development workflow integrates seamlessly with existing developer tools:

AI Toolkit for VS Code Integration

Model Conversion - PyTorch to ONNX or silicon-specific formats
Optimization Scripts - Pre-built optimizations for common architectures
Custom Optimization - Starter templates for specialized model tuning
Quality Evaluation - Built-in model testing and validation
Application Integration - Windows ML NuGet package integration

Implementation Pattern

// Core Windows ML Implementation Workflow
1. Reference Microsoft.AI.MachineLearning package
2. Download execution providers via downloadPackagesAsync()
3. Register execution providers with runtime
4. Set device selection policy (CPU/GPU/NPU)
5. Compile model for target hardware
6. Execute inference with optimized performance

Live Demo: AI Dev Gallery

Tucker demonstrates the AI Dev Gallery, a Microsoft Store application that showcases Windows AI Foundry capabilities. The demo focuses on image classification using a custom model.

Device Policy Demonstration

The demo highlights Windows ML’s intelligent hardware selection capabilities:

Manual Control Options:

Explicit CPU execution
Explicit GPU execution
Explicit NPU execution

Smart Policy Options:

Max Efficiency - Prioritizes power-efficient execution
Max Performance - Optimizes for speed and throughput
Minimize Power - Balances performance with battery life

Technical Implementation Details

The demo reveals the underlying code structure:

// Key Implementation Steps Demonstrated
1. Reference infrastructure package
2. Download execution providers (via Microsoft Store)
3. Register execution providers with runtime
4. Set device selection policy
5. Compile model for target hardware
6. Execute inference with optimized model

Developer Testimonials

The session includes video testimonials from early adopters highlighting quantified benefits:

“5x faster development” - Luyan Zhang, Filmora
Weeks to days deployment timeline reduction
Cross-platform compatibility with minimal code changes
Simplified ISV integration reducing time-to-market

Windows AI APIs: Ready-to-Use AI Capabilities

Timeframe: 00:26:15 - 15m 45s
Speakers: Dian Hartono (primary), Tucker Burns (supporting)

Inbox Model Integration

Dian Hartono explains that Windows AI APIs are built on top of inbox models distributed through Windows Update and optimized for Copilot+ PCs. These APIs provide abstraction layers so developers don’t need to manage underlying models directly.

Distribution and Management Architecture

Windows Update delivery - Models distributed via OS updates
Copilot+ PC optimization - Enhanced performance on new hardware
API abstraction layer - Consistent interface regardless of underlying model versions
WinApp SDK integration - Native integration with Windows development frameworks

API Portfolio and Capabilities

The comprehensive API suite covers both vision and language processing:

Vision APIs

Image Super Resolution - Intelligent image scaling and enhancement
Image Segmentation - Background removal and object isolation
Object Erase - Selective content removal from images
Image Description - Natural language image analysis and captioning
Text Recognition (OCR) - Text extraction from images and documents

Language APIs

Text Generation - Phi Silica-powered content creation
Conversation Summarization - Key point extraction and meeting summaries
Content Moderation - Automatic content safety and compliance checking

Live Demo: Image Description API

Dian demonstrates the Image Description API through the AI Dev Gallery, showcasing the complete developer workflow from exploration to implementation.

Development Integration Process

API Capability Check - Verify model availability on device
Model Instantiation - Create API instance with required parameters
API Invocation - Execute model inference with input data
Result Processing - Handle structured output from AI model
Visual Studio Export - Direct integration into development projects

Practical Demonstration Results

The demo includes a humorous comparison between human and AI image description:

Human Description (Tucker): “A simple one-bedroom apartment”

AI Model Output: “The image shows a simple, minimalistic floor plan of a small apartment or studio with a kitchenette, one bedroom, and one bathroom.”

This comparison demonstrates the AI’s superior attention to detail and comprehensive analysis capabilities.

Customization and Fine-Tuning

Timeframe: 00:42:00 - 12m 20s
Speakers: Dian Hartono (primary), Tucker Burns (supporting)

LoRA Fine-Tuning

Dian introduces Low-Rank Adaptation (LoRA) as a lightweight method for customizing inbox models. LoRA allows developers to “nudge” models toward specific domains, tones, or organizational communication styles without requiring full model retraining.

Use Cases for LoRA Fine-Tuning

Company Voice Optimization - Align output with organizational communication style
Workflow Specialization - Adapt models for specific business processes
Technical Terminology - Enhanced understanding of domain-specific language
Tone Adjustment - Modify model personality and response style

Knowledge Retrieval (RAG)

Retrieval-Augmented Generation (RAG) enables models to ground their responses in private, local knowledge bases.

RAG Implementation Benefits

Semantic Search Powered - Intelligent information retrieval from local data
Private Knowledge Grounding - Answers based on proprietary documents and content
Dynamic Content Handling - Real-time access to changing information
Multi-Modal Support - Text, image, and document integration

Live Demo: Fine-Tuning Workflow

Dian demonstrates the complete fine-tuning workflow using AI Toolkit for VS Code, focusing on a feedback categorization use case.

LoRA Training Process

Project Creation - Define fine-tuning objectives and model selection
Dataset Preparation - Training and test data for custom scenarios
Azure Integration - Cloud-based training with local model deployment
Evaluation and Testing - Quality assessment through AI Dev Gallery
Production Deployment - Seamless integration into applications

Practical Example: Feedback Categorization

Input: “This app is awesome, but it needs a better Get Started icon”

Before Fine-Tuning: Generic response handling After LoRA Adapter: Properly categorized as “Compliment + Feature Request”

This demonstrates how fine-tuning enables models to understand nuanced business contexts and provide more accurate, actionable insights.

Foundry Local: Open-Source Model Ecosystem

Timeframe: 00:54:20 - 8m 40s
Speakers: Tucker Burns (primary), Dian Hartono (supporting)

Platform Capabilities

Tucker introduces Foundry Local as Microsoft’s solution for running open-source language models locally on Windows. The platform integrates with Azure AI Foundry while supporting extensible model catalogs.

Key Features

Azure AI Foundry Integration - Pre-optimized models for local execution
Cross-Platform Compatibility - CPU, NPU, GPU support across hardware vendors
Extensible Platform - Support for multiple model catalogs beyond Azure
Simple Installation - WinGet install Microsoft.FoundryLocal for immediate access

Command-Line Interface

Foundry Local provides developer-friendly command-line tools for model management:

# Core Commands Demonstrated
foundry model list          # Check available models for current hardware
foundry model run <name>    # Download and run model locally
foundry model cache         # Check cached models on device
foundry model run phi-4-mini-reasoning  # Interactive chat mode

SDK and API Integration

The platform enables seamless migration from cloud to local inference:

// Cloud Endpoint Configuration
const cloudEndpoint = "https://api.azure.com/openai";
const cloudModel = "phi-4-reasoning";

// Local Endpoint with Minimal Code Changes
import { FoundryLocalManager } from 'foundry-local-sdk';
const manager = new FoundryLocalManager();
const localEndpoint = await manager.getEndpoint("phi-4-mini");

Live Demo: Local Model Execution

Tucker demonstrates real-time local model execution with reasoning capabilities:

Query: “Tucker has one computer. There are four total. How many does Dian have?” Model Response: Step-by-step logical reasoning with final answer Hardware: Local NPU execution with real-time processing

Resource Optimization Benefits

Single Instance Sharing - Multiple applications share one model copy
Automatic Hardware Optimization - Performance tuning for specific silicon
Storage Efficiency - Reduced disk usage across applications
Intelligent Memory Management - Automatic loading and unloading of models

AI Workstations and Hardware Optimization

Timeframe: 01:03:00 - 2m 15s
Speakers: Dian Hartono

Dian briefly introduces Windows AI Workstations as optimized hardware solutions for AI development. She explains that AI model performance is shaped by the entire lifecycle: training, fine-tuning, and inferencing.

Hardware Selection Strategy

GPU Selection - Optimized for training and intensive inference workloads
NPU Utilization - Efficient background processing and real-time inference
Cloud Integration - Hybrid approaches combining local and cloud resources
Development Workflow - Hardware matching specific development stages

The AI Workstation Booth at Build 2025 provides hands-on experience with optimized hardware configurations.

Call to Action and Resources

Timeframe: 01:05:15 - 1m 45s
Speakers: Dian Hartono

Dian concludes with specific actionable steps for developers:

Immediate Next Steps

Explore Documentation - Visit aka.ms/WindowsAI for comprehensive resources
Provide Feedback - Share scenarios and requirements with the Windows AI team
Attend Build Sessions - Multiple Windows AI Foundry breakouts, labs, and booth visits
Install Tools - AI Dev Gallery, AI Toolkit for VS Code, Foundry Local

Community Engagement

Email Feedback - Direct communication with Windows AI team for scenarios and requirements
Build Booth Visits - AI Workstation demonstrations and expert consultations
Hands-on Labs - Practical experience with fine-tuning and customization

Appendix

Technical Terminology Reference

ONNX (Open Neural Network Exchange): Industry-standard format for representing machine learning models, enabling interoperability between different AI frameworks.

NPU (Neural Processing Unit): Specialized processor designed specifically for AI workloads, offering superior power efficiency compared to general-purpose processors.

LoRA (Low-Rank Adaptation): Fine-tuning technique that adapts pre-trained models to specific tasks by learning low-rank decompositions of weight updates.

RAG (Retrieval-Augmented Generation): AI technique that combines language generation with information retrieval from external knowledge sources.

Session Production Details

Technical Setup: Live demonstrations required coordination between AI Dev Gallery, Visual Studio Code with AI Toolkit, and Foundry Local command-line interface.

Demo Coordination: Both speakers coordinated seamlessly during live demos, with Dian leading Windows AI APIs demonstrations and Tucker handling Windows ML and Foundry Local sections.

Audience Interaction: The session included spontaneous moments like the French Bulldog vs. Bernese Mountain Dog correction and the apartment description comparison, demonstrating genuine AI capability assessment.

Industry Context and Competitive Landscape

The session positioned Windows AI Foundry within the broader context of local AI development trends, emphasizing Microsoft’s unique approach of providing multiple pathways (APIs, custom models, open-source integration) rather than forcing developers into a single solution.

References

Official Microsoft Documentation

Windows AI Documentation - Comprehensive developer resources including API references, implementation guides, and best practices. Essential starting point for developers implementing Windows AI Foundry capabilities.
AI Toolkit for VS Code - Visual Studio Code extension providing model conversion, optimization, and fine-tuning capabilities. Critical tool for the complete Windows ML development workflow demonstrated in the session.
Windows ML Documentation - Technical documentation for the Windows ML platform, including API references and implementation patterns shown during the live demonstrations.

Development Tools and Resources

AI Dev Gallery (Microsoft Store) - Sample application showcasing Windows AI Foundry capabilities. Provides hands-on exploration of APIs and exportable project templates as demonstrated by both speakers.
Windows ML NuGet Package - Core runtime package for Windows ML development. Essential for implementing the device policies and cross-silicon optimization features shown in Tucker’s demonstrations.
WinApp SDK - Windows application development SDK including Windows AI APIs integration. Provides the abstraction layer for inbox models discussed by Dian Hartono.

Industry Standards and Frameworks

ONNX (Open Neural Network Exchange) - Industry-standard model format underlying Windows ML’s cross-platform compatibility. Critical for understanding the model conversion and optimization processes demonstrated in the session.
Hugging Face Model Hub - Open-source model catalog referenced as a source for Windows ML model integration. Demonstrates the platform’s commitment to open-source ecosystem support.

Regulatory and Compliance Resources

GDPR Compliance Guide - European Union data protection regulation referenced as a key driver for local AI processing. Relevant for understanding privacy benefits of on-device model execution.
HIPAA Security Rule - Healthcare data protection requirements mentioned as compliance driver for local AI. Important context for healthcare applications of Windows AI Foundry.

Hardware and Silicon Partners

Intel AI Developer Resources - Documentation for Intel CPU and GPU optimization supported by Windows ML cross-silicon strategy.
Qualcomm AI Development - NPU development resources for Snapdragon platforms supported by Windows AI Foundry’s hardware acceleration approach.